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Summary. This paper develops nonparametric methods for the survival analysis of epidemic 
data based on contact intervals. The contact interval from person i to person j is the time 
between the onset of infectiousness in i and infectious contact from i to j, where we define 
infectious contact as a contact sufficient to infect a susceptible individual. We show that the 
Nelson-Aalen estimator produces an unbiased estimate of the contact interval cumulative haz- 
ard function when who-infects-whom is observed. When who-infects-whom is not observed, 
we average the Nelson-Aalen estimates from all transmission networks consistent with the ob- 
served data using an EM algorithm. This converges to a nonparametric MLE of the contact 
interval cumulative hazard function that we call the marginal Nelson-Aalen estimate. We study 
the behavior of these methods in simulations and use them to analyze household surveillance 
data from the 2009 influenza A(H1 N1 ) pandemic. In an appendix, we show that these methods 
extend chain-binomial models to continuous time. 
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1. Introduction 

Infectious diseases remain one of the greatest threats to human health and commerce, and 
the analysis of epidemic data is one of the most important applications of statistics in public 
health. If person i infects person j with a given disease, the generation interval is the time 
between the infection of i and the infection of j. The serial interval, which is often used 
as a proxy for the generation interval, is the time between the onset of symptoms in i 
and the onset of symptoms in j. Data from several recent and historical epidemics have 
been analyzed using methods based on generation or serial interval distributions, including 
the 1918 influenza (Mills et al., 2004), severe acute respiratory syndrome (SARS) (Lipsitch 
et al., 2003; Wallinga and Teunis, 2004), pandemic influenza A(H1N1) (Fraser et al., 2009; 
McBryde et al., 2009; Yang et al., 2009), and avian influenza (Ferguson et al., 2005, 2006). 

Though the generation and serial interval distributions are often considered invariant 
features of an infectious disease (Fine, 2003), these distributions can change systematically 
over the course of an epidemic (Svensson, 2007; Kenah et al., 2008). When a susceptible 
person is exposed to multiple infectious people, his or her infector must be the first person 
to make infectious contact. Thus, the mean generation and serial intervals contract as the 
prevalence of infection increases. When transmission is rapid within groups of close contacts 
such as households or schools, this can occur even when the global prevalence of infection 
remains low (Kenah et al., 2008). 

Statistical methods for infectious disease data that use generation or serial interval dis- 
tributions are based on the Lotka-Euler equation (Wallinga and Lipsitch, 2007; Roberts 
and Heesterbeek, 2007) or a branching-process approximation to the early spread of dis- 
ease (Wallinga and Teunis, 2004; White and Pagano, 2008). In both of these approaches, 
epidemic spread is modeled as a process that creates a population rather than a process 
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of percolation through an existing population. Thus, these methods fail to account for 
the effects of exposure to multiple infectors and ignore information contributed by unin- 
fected person-time. Since they assume generation intervals are independent and identically 
distributed, they implicitly assume a constant latent period (infection to onset of infec- 
tiousness) and a constant infectious period. When serial intervals are used as a proxy for 
generation intervals, they implicitly assume that the incubation period (infection to onset 
of symptoms) is also constant. 

Kenah (2010) outlined an alternative analysis of epidemic data based on contact inter- 
vals. Informally, the contact interval from an infectious person i to a susceptible person 
j is the time between the onset of infectiousness in i and the first infectious contact from 
i to j, where we define infectious contact to be a contact sufficient to infect a susceptible 
individual. The contact interval is similar to the generation interval, but it is defined for 
all possible infector-infectee pairs, not just those in which infection is actually transmitted, 
and it begins with the onset of infectiousness, not infection. The contact interval from i 
to j will be right-censored if j is infected by k ^ i prior to infectious contact from i or 
if i recovers before making infectious contact with j. This censoring is accounted for us- 
ing methods from survival analysis. Statistical methods based on contact intervals avoid 
many of the problems with methods based on generation and serial intervals. They can 
incorporate a greater variety of transmission models, they use information contributed by 
uninfected person-time, and they allow clearer expression of analytical assumptions. 

In epidemic modeling, it is common to specify infectious contact within an underlying 
contact process between individuals in the population. While a person i is infectious, he 
or she has some probability of transmitting infection to a susceptible person j each time a 
contact from i to j is made (e.g., Rhodes et al., 1996). Models based on contact intervals 
define an infectious contact as a contact sufficient to infect a susceptible individual, ignore 
all contacts made by person i outside his or her infectious period, and ignore all infectious 
contacts from i to j after the first. This relaxes the assumption that the contact process 
is unaffected by infection and limits our attention to those contacts relevant to disease 
transmission. Any model specified in terms of an underlying contact process can be specified 
cquivalcntly in terms of contact intervals. 

This paper outlines the nonparametric survival analysis of epidemic data based on con- 
tact intervals. The rest of Section 1 defines the general stochastic Susceptible-Exposed- 
Infcctious- Removed (SEIR) model, describes our observed data, and reviews the parametric 
survival analysis of epidemic data. Section 2 shows that the Nelson- Aalen estimator pro- 
duces an unbiased estimate of the cumulative hazard function of the contact interval distri- 
bution when who-infected-whom is observed. When who-infectcd-whom is not observed, the 
Nelson- Aalen estimates from all transmission networks consistent with the observed data are 
averaged to get a marginal Nelson-Aalen estimator. Since the transmission network prob- 
abilities are unknown, an expectation-maximization (EM) algorithm is used to iteratively 
reweight the possible transmission networks, producing a series of marginal Nelson-Aalen 
estimates that converges to a nonparametric maximum likelihood estimate (MLE) of the 
contact interval cumulative hazard function. Section 3 explores the performance of these 
estimators in simulations. Section 4 uses them to analyze household surveillance data from 
the 2009 influenza A(H1N1) pandemic. Section 5 discusses the limitations, advantages, and 
future development of these methods. Appendix A shows how our methods can be used 
when who-infected-whom is partially observed, and Appendix B shows how our methods 
generalize chain-binomial models to continuous time. 
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1.1. Stochastic SEIR model and observed data 

Consider a closed population of n individuals assigned indices 1 . . . n. Each individual is 
in one of four possible states: susceptible (S), exposed (E), infectious (I), or removed (R). 
Person i moves from S to E at his or her infection time ti , with U — oo if i is never infected. 
After infection i has a latent period of length Ei , during which he or she is infected but not 
infectious. At time ti + £j, i moves from E to I, beginning an infectious period of length 
Li. At time ti + r iy where = e t + tj is the recovery period, i moves from I to R. Once in 
R, i can no longer infect others or be infected. The latent period is a nonnegative random 
variable, the infectious and recovery periods are strictly positive random variables, and the 
recovery period is finite with probability one. 

An epidemic begins with one or more persons infected from outside the population, 
which we call imported infections. For simplicity, we assume that epidemics begin with one 
or more imported infections at time and there are no other imported infections. 

After becoming infectious at time ti + £j, person i makes infectious contact with j ^ i 
at time tij = ti + £, + r 2 * , where the infectious contact interval t*j is a strictly positive 
random variable with t* = oo if infectious contact never occurs. Since infectious contact 
must occur while i is infectious or never, t*j <E (0, ij] or t*^ = oo. We define infectious 
contact to be sufficient to cause infection in a susceptible person, so tj < tij . 

For each ordered pair ij, let CV, = 1 if infectious contact from i to j is possible and 
Cij = otherwise. We assume that the infectious contact interval t*- is generated in the 
following way: A contact interval is drawn from a distribution with hazard function 
Ajj(r). If Tij < ^ and Cij = 1, then r t * = r^. Otherwise, T*j = oo. In this paper, we 
assume that all contact intervals have a continuous distribution and, for a fixed i or a fixed 
j, the contact intervals t^ , j ^ i, are independent. 

The parametric methods in Kcnah (2010) are derived using counting processes and mar- 
tingales, which are described in Kalbfleisch and Prentice (2002) and Aalen et al. (2009). Let 
Si(t) = lt<u &ndli(t) — l t€ ( t . +e . it . +r .] be the susceptibility and infectiousness processes for 
person i, where lx = 1 if A is true and zero otherwise. These processes are left-continuous 
and infectious contact from i to j is possible at time t only if CijIi(t)Sj(t) = 1. 

Following Wallinga and Teunis (2004), let Vj denote the index of the person who infected 
person j, with Vj = for imported infections and Vj = oo for persons not infected at or 
before time T. The transmission network is the directed network with an edge from Vj to 
j for each j such that tj < T. It can be represented by a vector v = (v\, . . . ,v n ). Let 
Vj = {i : Cijliitj) = 1} denote the set of persons who could have infected person j, which 
we call this the infectious set of person j. Let V denote the set of all possible v consistent 
with the observed data. 

Our population has size n, and we observe the times of all S — > E (infection), E — > I 
(onset of infectiousness), and I — > R (removal) transitions in the population between time 
and time T. For all ordered pairs ij such that i is infected, we observe Cij. Let m denote 
the total number of infections we observe. Our goal is to estimate the contact interval 
distribution via its cumulative hazard function. 

1.2. Parametric survival analysis of epidemic data 

We begin by reviewing the parametric survival analysis of epidemic data (Kcnah, 2010), 
where we assume that \j (r) comes from a family of hazard functions indexed by a parameter 
vector with a unique 6>o such that A^(t) = A(r;#o) f° r a U ij- To guarantee that the 
maximum likelihood estimates are consistent and asymptotically normal, we assume that, 
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for all t, A(r; 9) and In A(r; 9) have continuous third derivatives with respect to 9 in an open 
neighborhood of 9$. 



1.2.1. Likelihood when who-infccts-whom is observed 

Let Nij(t) — lt>t y count the first infectious contact from i to j. By definition, Nij{t) is 
right-continuous with left-hand limits (cadlag). We count only the first infectious contact 
because j is infected at or before Since Ay (r) = A(r; O ) an d A/ij(0) = 0, the process 



M tJ ((; 9) = Mij{t) - [ X(u-U- e t ; 9)C ij l i {u)\ u < Uj du 
Jo 



(1) 



is a zero- mean martingale when 9 = 9 a . Since Sj(u)1 u <t is predictable, the stochastic 
integral 

M ij (t;0) = / S^ujWdMy^J) (2) 
Jo 

is a zero- mean martingale when 9 = 9 n . The corresponding counting process is 

%(*)=/ 5 j (ti)l u < r dA/" <J '(u), (3) 
Jo 

which counts infectious contacts from z to j while j is susceptible and before time T. The 
corresponding log likelihood is 

lij{6)= [ In X(u-t i -e i ;e)S j (u)dN ij (u)- [ X(u - U - e t -9)C t3 l t {u)S {u) du, (4) 



which has the score process 

U ij (t;0)=J ^ln\(u-t i -e i ;6)dM ij (u;6). (5) 

Since it is the integral of a predictable process with respect to a zero-mean martingale, 
Uij(t;9 ) is a zero-mean martingale. If tj < T, this likelihood requires that we observe 
whether or not i made infectious contact with j at time tj . 

Now fix j. If we observe infectious contacts in all ordered pairs ij from time until time 
T, the log likelihood is 

e. j (9) = '££ ij (6), (6) 

with the score process 

U.j{t; 0)=£E/«(t;0). (7) 

U.j(t; 9q) is a zero-mean martingale because it is a sum of zero-mean martingales. If tj < T, 
the likelihood in equation (6) requires that we observe which counting process Nij jumped 
at tj, which is equivalent to observing which i infected j. The complete-data log likelihood 
when we observe who-infects-whom is £(9) — X^j=i ^ j(^) with the score process U(t; 9) = 
Y^j=i U.j(t; 9). Clearly, U(t; 9 ) is a zero-mean martingale. 
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1.2.2. Likelihood when who- infects- whom is not observed 

Now assume that we do not observe who infected person j. The corresponding counting 
process is 

N. j (t)=J^N ij (t), (8) 

and the process M.j(t; 9) — J2i^j Mij(t; 9) is a zero-mean martinale when 9 = 9 . Expand- 
ing M.j(t; 9), we get 

M.j(t; 9) = N.j(t) - f X.j(u; 9)S 3 {u)\ u < T du, (9) 
Jo 

where 

\. j (t;9) = ^2\{t-t i -e i ;0)Ci j X i (t), (10) 

is the total hazard of infectious contact with j at time t given 9. When person j is observed 
from time to time T, the log likelihood is 

£.j(0)= ( lnX. j (u;6)dN. j (u)- [ X. j (u;6)S j (u) du (11) 
Jo Jo 

with the score process 

U-j(t;e) = jT^lnA^u; 9)dM. J {u-9). (12) 

U.j(t; 9 ) is a zero-mean martingale because it is the integral of a predictable process with 
respect to M.j(t; 9 ). The log likelihood for the complete observed data when we do not ob- 
serve who-infected-whom is £(9) = Y^j=i £-j(9) with score process U(t: 9) = Y^j=i U.j(t; 9), 
which is a zero- mean martingale when 9 = 9q. 



2. Methods 

In this section, we extend the Nelson- Aalen estimator (Altshuler, 1970; Nelson, 1972; Aalen, 
1978) to derive a nonparametric estimators of the cumulative hazard function of the con- 
tact interval distribution. We continue to assume that the contact interval distribution is 
continuous, but we drop the dependence of the hazard function on a parameter vector 9, so 
it is simply A(t). Let 

A(t) = f X(u) du (13) 
Jo 

denote the cumulative hazard function. We call r the infectiousness age. To derive para- 
metric estimators of the contact interval distribution, we used counting processes and mar- 
tingales defined in absolute time. To derive nonparametric estimators, we use counting 
processes and martingales defined in infectiousness age. Whereas absolute time processes 
arc defined for all ordered pairs ij, infectiousness age processes are defined only for those 
ij in which < oo. In our notation, we use the argument t for absolute time and t for 
infectiousness age. Martingales and counting processes in infectiousness age are given an 
asterisk superscript. 
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We show that the Nelson- Aalen estimator generates an unbiased estimate of A(r) when 
who-infects-whom is observed. When who-infects-whom is not observed, we average the 
Nelson-Aalen estimates from all v e V to get a marginal Nelson-Aalen estimator. Since 
the probability of each v G V is unknown, we use an EM algorithm (Dempster et al., 
1977) to iteratively reweight each possible v, obtaining a sequence of marginal Nelson- 
Aalen estimates that converges to a nonparametric MLE of A(t). Finally, we show how 
these estimates can be approximated in mass-action SEIR models using only data about 
infected persons. In Appendix A, we show how these methods adapt to a situation where 
the transmission network is partially observed. 



2. 1. Nonparametric estimation when who-infects-whom is observed 

First, we assume that we observe who infected whom. For each person i such that ti+Ei < T, 
let Af*j(r) = l T >Tij count the first infectious contact from i to j at or before infectious- 
ness age r in person i, and let I*(t) = l re (o.ti] indicate whether i remains infectious at 
infectiousness age r. By definition, A/^*(t) is cadlag and Ti{r) is left-continuous. Since 
M,(0) = 0, 

M*. (r) = m (r) - f \{u)C l0 T* («) du (14) 



is a zero- mean martingale. Now let S*j (r) = Sj (ti + + r) indicate the susceptibility of 
person j at infectiousness age r of person i and let % = T — ti — Si denote the time between 
the onset of infectiousness in i and the end of observation. Since Sj*(t)1 t <7- is predictable, 

M*j(T)= f S?.( U )l u<ri dM?.( U ) (15) 
Jo 

is a zero-mean martingale corresponding to the counting process 



rS£(«)l u < Ti dA/3(u), 
Jo 



N* j (r)= I S£(«)l u < ri dA/3(u), (16) 



which counts infectious contacts from i to j prior to time T while j remains susceptible. 
Now fix j. Since it is a sum of zero-mean martingales, 

M* (r) = £ M*j (t) = N* (t) - T X(u)Y.j(u) du, (17) 

i:ti+E z <T J ° 

is a zero-mean martingale, where 

N *j(T)= E W - ^> tl -t Vi (is) 

i:ti+6i<T 

is a counting process that jumps at the infectiousness age at which vj (the infector of j) 
makes infectious contact with j and 

Y.j(r)= Yl ^*(r)5*.(r)l r < ri (19) 

i:ti+£i<T 

denotes the number of persons who could have infected j at infectiousness age r. Note that 
Y.j(t) is left-continuous and decreasing in r. 
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Finally, consider the combined observations of all j. Since it is a sum of zero- mean 
martingales, 



M* (t) = V M* (r) = N* (t) - / \(u)Y(u) du 

3 = 1 J ° 



(20) 



is a zero-mean martingale, where 



iV(r)=^7V.*(r) (21) 
i=i 



counts the number of observed infectious contacts with susceptible individuals occuring at 
infectiousness age < r and 

n 

Y(r) = J2y-j(r) (22) 
i=i 

denotes the number of contact intervals of length > r that were observed. Like each Y.j(r), 
Y(t) is left-continuous and decreasing in r. 

We are now ready to show that the standard Nelson- Aalen estimator will produce an 
unbiased estimate of A(r). Let 



Hr) = I -fg^ dN*(u) (23) 

be the Nelson-Aalen estimate of A(r) and let T = max{r : Y(t) > 0}. Then 

A(r) - A(t AT)=| T dM*(u), (24) 

is a zero-mean martingale, where r A T = min(r, T) . Therefore, A(t) is an unbiased 
estimate of A(r) for all re [0, T]. For these r, a variance estimate for A(r) — A(r) is given 
by its optional variation process 

t 2 (r)= r^r^dNiu). (25) 



J, Y{uf 

Using the martingale central limit theorem and a log transformation, we get the approximate 
pointwise 1 — a confidence limits 

A(, ) exp( ± ^(l-f)), (26) 

where $ is the cumulative distribution function of the standard normal distribution. Note 
that the point estimate A(t) in equation (23) is valid for an arbitrary contact interval dis- 
tribution but the variance estimate <t 2 (t) in equation (25) and the approximate confidence 
interval in equation (26) assume a continuous contact interval distribution. 



2.2. Nonparametric estimation when who-infects-whom is not observed 

Now assume that we do not observe who-infected-whom. We can no longer calculate the 
Nelson-Aalen estimate in equation (23) because we do not know which of the contact inter- 
vals are censored. However, 



A(r)=E[A(r)] =E E[A(r) j observed data] 



(27) 
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by the law of iterated expectation, so we can still obtain an unbiased estimate of A(t). 
Let A v (r) denote the value of A(r) that we would have calculated had we observed the 
transmission network v. Then 

A(t) = ^2 A v( T ) Pr(v|obscrvcd data) (28) 
vev 

is an unbiased estimate of A(r) for each r £ [0,7"]. For each v <G V, Pr(v|observed data) 
can be calculated if we know A(t) (Kenah et al., 2008). For each person j with tj < oo, the 
probability that j was infected by person i € Vj is 

Pij = — ~ ~ ~ £i ' ) (29) 

and the probability of a given transmission network v = (v\, . . . ,v n ) is 

Pr(v|observed data) = ] [ p Vj j, (30) 

j: tj <oo 

so the infector vj E Vj of each infected j can be chosen independently of all Vi, i 7^ j. 
Note that equations (29) and (30) assume a continuous contact interval distribution, which 
ensures that simultaneous infectious contacts have probability zero. 

With a known or estimated A(t), it is easy (but unnecessary) to calculate a marginal 
Nelson- Aalen estimate. We outline this calculation and then use it as the basis of an EM 
algorithm that converges to a nonparametric MLE of A(t). First, let 

N* (t\ A) = E [ N* (t) I observed data, A ] , (31) 

which is a cadlag step function with a jump of size at r = tj — U — Si for each i £ Vj. 
Now let 

n 

A>(t|A)=^A>.(t|A). (32) 

By equations (28) through (30), the marginal Nelson- Aalen estimate given A(r) is 

MT\\)=J o T ^^dN(u\\). (33) 

If A(t) is the true contact interval hazard function, this is an unbiased estimate of A(t) 
for each t e [0,7"]. The whole point of calculating A(r) is that A(r) is unknown, but 
equation (33) can be used in the following EM algorithm: 

(a) Begin with an initial \(°\t). This could be a parametric estimate or a constant. Use 
this to calculate an initial marginal Nelson- Aalen estimate A(t|A' '). 

(b) Use A(r|A( fc )) to calculate a new hazard function estimate A' fc+1 ^(r). 

(c) Update the probabilities in equation (29) using A^ fc+1 ^(r). 

(d) Use the updated probabilities to calculate A(T|A( fc+1 )) using equation (33). 

(e) Repeat the smoothing step (b), the E-step (c) and the M-step (d) until the sequence 
(A(r|A( fc ))) converges. 
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The limit of this sequence is the marginal Nelson- Aalen estimate, which we denote A(r). 
To show that this is truly an EM algorithm, we must prove that A(r|A( fe ') maximizes the 
expected log likelihood given \^(t). To do this, we adapt the argument of Kaplan and 
Meier (1958) that their product-limit estimate of the survival function is a nonparametric 
MLE. 

Let n , . . . , tm denote the distinct infectiousness ages at which infectious contacts might 
have been made, and let S(t) denote an arbitrary survival function for the contact interval 
distribution. The conditional probability of failure (i.e., infectious contact) at infectiousness 
age Tj given survival (i.e., no infectious contact) until Tj is 

V, = ^ (34) 
S( Tj ) 

where S(t^) = lim T -f- Tj . S(t) exists because S(t) is cadlag. When who-infects-whom is not 
observed, the expected log likelihood given A« is 

M 

G(S\\W) = (inpidJVTfolAW) +Hl-p j )[Y(r j ) dN*( Tj \X^)]). (35) 
i=i 



The j th term of G(S) is maximized by 



diV*(7-j|A< fc )) 



so the survival function that maximizes G(S) is 



S(r)= ]J (i-W) ( 37 ) 

= 7rS(l-i^dAr(«|AW)), (38) 

where 7T represents the product integral described in Kalbflcisch and Prentice (2002) and 
Aalen et al. (2009). The Nelson- Aalen estimate corresponding to the Kaplan-Meier estimate 
in equation (38) is 

A( T |A«)= r%^ dA r (r | A (fc)). (39) 
Jo Y(u) 

Therefore, each iteration of the proposed EM algorithm maximizes the expected log like- 
lihood, so the marginal Nelson-Aalen estimate is a nonparametric MLE of A(r). The 
corresponding marginal Kaplan-Meier estimator is a nonparametric MLE of S(t). 

The variance (? 2 (t) of A(r) can be estimated using the conditional variance formula. 
Conditioning on the transmission network v, we get 

a 2 (T)=E[^(T)] +Var(A v (r)), (40) 



where A v (r) is the Nelson-Aalen estimate from equation (23) and &^(t) is the variance 
estimate from equation (25) that we would have calculated had we observed the transmission 
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network v. Let A(r) denote the hazard function estimate corresponding to A(r), and let 
N*(t) = N*(t\X). The first term of equation (40) reduces to 

E[ ^ )]= [T(f #(,i) ' (41) 



and the second term reduces to 



where N*(t) = N*(t\\). Therefore, 

j . i j < oo 

Using the martingale central limit theorem and a log transformation, we get the approximate 
pointwise 1 — a confidence limits 

A M exp( ± |a*-(l-f)). (44) 
2.3. Estimating the hazard function 

Obtaining A^ fe+1 ' (r) from A(r|A^ fe ^) in step (b) of the EM algorithm is nontrivial. In general, 
the increments of A(t|A^- ) ) cannot be used directly as A^ fe+1 ^(r). To see this, consider a 

(k) 

person j with tj < T. For each person i <G Vj, let p\^' denote the estimated probability that 
i infected j in the k th iteration of the EM algorithm, with p\^ denoting the initial estimate. 
In the first iteration, we get p\^ oc /Y(rjj). After k iterations, 

(0) 



„( fe+1 ) oc — ^ (45) 



Since Y(t) is decreasing in r, the EM algorithm would converge to a marginal Nelson- Aalen 
estimate where the i e Vj with the longest is assigned probability one of being vj . This 
problem arises because equation (29) assumes a continuous contact interval distribution, 
which ensures that simultaneous infectious contacts have zero probability. Since the contact 
interval distribution is continuous, the intervals between the increments of A(t|A^^) contain 
information about A(r) that is ignored if these increments are used directly as an estimate 
of A(t). Thus, obtaining A( fe+1 >(r) from A(t|AW) requires some form of smoothing. 

The algorithm depends on the smoothed estimate of A(t) only to calculate the probabil- 
ities pij in equation (29). In our experience, cubic smoothing splines and kernel smoothers 
produce nearly identical results, so the algorithm seems insensitive to reasonable choices 
among smoothers. 



2.4. Approximate Nelson-Aalen estimates for mass-action SEIR models 

In a mass-action model, CV, = 1 for all ordered pairs ij but the hazard of infectious contact is 
inversely proportional to the population size n. More specifically, if A„(r) is the cumulative 
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hazard function of the contact interval distribution in a population of size n, then 

A»(r) = (46) 

where we call A* (r) the normalized cumulative hazard function of the contact interval dis- 
tribution. With knowledge of n, the methods of the previous two sections could be used to 
estimate A„(t) and A*(r). In Kenah (2010), it was shown that the full parametric likeli- 
hoods for mass-action SEIR models can be approximated by likelihoods that depend only 
on data about infected individuals. Here, we derive approximate nonparametric estimates 
of A*(t) that depend only on data about infected individuals. For a fixed number m of 
infections, these approximations become exact in the limit as n — > oo. Thus, they can be 
used to analyze data from the early stages of an epidemic, when m <C n. 

Let V*(t) = J2i m T )^-T<% denote the number of infected persons who remain infectious 
and under observation at infectiousness age r. By equations (19) and (22), 

Y (r)= E (/*(t)W<£S*,(t)). (47) 

i:U+ei<T j-.j^i 

Since n - m < S *j( T ) < n - 1 for all r e (0,71 and a11 h 

(n - m)y,(r) < Y(t) < (n - 1)Y,(t). (48) 

Let A„(t) be the Nelson- Aalen estimate of A(r) obtained in a population of size n when 
who-infects-whom is observed, and let 

L(t)= f%M^dAr». (49) 



r.(u) 

Since n^> m, Y(u) > if and only if Y*(u) > 0. Thus, 

n — \ - 

A,(t) < (n - 1)A„(t) < A(r). (50) 

n — m 

by equations (23) and (48). Therefore, A*(r) — > (n— l)A„(r) for a fixed m as n — > oo. Since 
A(t) is a consistent estimator of A n (r), this implies that A*(r) is a consistent estimator of 
A*(r) if we take limits first as n — > oo and then as m — > oo. Similarly, 

^'Iw'" 1 (51) 

is a consistent estimate of the variance of A*(t). Using the martingale central limit theorem 
and a log transformation, we get the asymptotic 1 — a confidence limits 

A^expfif^-^l-f)), (52) 

where we take limits first as n — > oo and then as m — > oo. 

When who-infects-whom is not observed, we have the approximate marginal Nelson- 
Aalen estimate 

A.(t|A.) = [ T 1y ' { J }> " dN*(u\\*). (53) 
Jo Y *( u ) 
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Since the probabilities pij in equation (29) depend on A^ fc+1 ^ (r) only up to a multiplicative 
constant, they can be estimated by smoothing the normalized cumulative hazard function 
A*(T|Al fe '). Thus, A*(r|A*) can replace A(r|A) in the EM algorithm from Section 2.2. Let 
A*(r) denote the marginal Nelson- Aalen estimate to which this algorithm converges, let 
A*(t) denote the corresponding normalized hazard function, and let N*(t) — JV*(r|A*). 
The variance of A* (r) is 

* <*> = 2 r w - § ( r w (m » 

where N*(t) = iV*-(r|A»). The asymptotic 1 — a confidence limits are 




3. Simulations 

To evaluate the performance of the methods from Section 2, we conducted simulations of 
network-based and mass-action epidemics. In each simulation, we used data on the first 
1,000 infections from an epidemic in a population of 100,000 to calculate the Nelson- Aalen 
and marginal Nelson- Aalen estimates of A(t) and the Kaplan- Meier and marginal Kaplan- 
Meier estimates of the corresponding survival function. For each of these estimators, we 
looked at whether the confidence interval contained the true value of the estimated function 
at the 5 th , 10 th , 25 th , 50 th , 75 th , 90 th , and 95 th percentiles of all possible (i.e., censored 
and uncensored) contact intervals. After 1,000 simulations, we calculated the coverage 
probability for each estimator at each quantile. For each coverage probability, we calculated 
an exact 95% confidence interval. 

All models had a latent period of zero and an exponential infectious period with mean 
one. The smoothing step of the EM algorithm was done via an inverse-variance weighted 
cubic smoothing spline with a smoothing parameter chosen by generalized cross-validation 
(Wegman and Wright, 1983). Convergence of the EM algorithm was monitored by calcu- 
lating the mean of the absolute values of the differences between the current and previous 
marginal Nelson- Aalen estimates at the 5 th , 10 th , 15 th ,. . . , 85 th , 90 th , and 95 th percentiles 
of the possible contact intervals. Informally, we call this the "LI difference". All EM al- 
gorithms began by assuming that all possible transmission networks were equally likely, 
which is equivalent to assuming an exponential contact interval distribution. We ran the 
algorithm for a minimum of 5 iterations, and it was continued until an LI difference less 
than a specified tolerance was achieved or the 50 th iteration was completed. 

All epidemic models were written in Python (www . python . org) using the NumPy and 
SciPy packages (www.scipy.org). For network-based models, our contact networks were 
generated using the NetworkX package (networkx . lanl . gov). Statistical analysis was con- 
ducted in R (www.r-project.org) via the RPy2 package (rpy.sourceforge.net). The 
code for the models and estimators is available as Online Supplementary Information. 

3.1. Network-based simulations 

In all network-based simulations, the contact network was a Watts-Strogatz small-world 
network (Watts and Strogatz, 1998), which mimics the high clustering and low diameter 
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of real human contact networks. Starting with a ring of 100,000 nodes, each node was 
connected to its 10 nearest neighbors. Each edge was then rewired to a randomly chosen 
node with probability .1. Thus, 34.9% of nodes are connected only to their ten nearest 
neighbors, 38.7% are have one long-distance edge, 19.4% have two long-distance edges, 
5.7% have three long-distance edges, and 1.3% have four or more long-distance edges. This 
network structure gave us high clustering, so infected nodes typically had several possible 
infectors. A new contact network was built for each simulation, so the results do not reflect 
the idiosyncrasies of any particular realization of the network. 

We used Weibull(.5, 1), cxponential(l), and Weibull(2, 1) contact interval distributions, 
where Weibull(s, r) denotes a Weibull distribution with shape parameter s and rate param- 
eter r. The corresponding cumulative hazard functions are: 



Note that the exponcntial(l) distribution is the same as a Weibull(l, 1) distribution. The 
hazard of infectious contact decreases throughout the infectious period in the Weibull(.5, 1) 
case, remains constant in the exponential(l) case, and increases throughout the infectious 
period in the Weibull(2, 1) case. 

The LI tolerance for the EM algorithm was set to .0005. This was approximately the 
minimum tolerance consistently achieved by the EM algorithm before the LI differences 
became a chaotic sequence of small numbers. 



Results The EM algorithm for the marginal Nelson-Aalen estimate converged easily in 
all simulations. For models with exponential(l) contact intervals, the desired tolerance 
was achieved within 5 iterations for all models. For Weibull(.5, 1) contact intervals, the 
maximum number of iterations was 6. For the Weibull(2, 1) contact intervals, the maximum 
number was 8. Table 1 shows the coverage probabilities and exact 95% confidence intervals 
for the Nelson-Aalen, marginal Nelson-Aalen, Kaplan-Meier, and marginal Kaplan-Meier 
estimators in models with Weibull(.5, 1) and Weibull(2, 1) contact interval distributions. 
Coverage probabilities for the Nelson-Aalen and Kaplan-Meier estimators was close to .95 
across the range of available data for both models. Coverage probabilities for the marginal 
Nelson-Aalen and Kaplan-Meier estimators was slightly lower but above .90 in all cases. 
Coverage probabilities for the model with exponential contact intervals were above .93 for 
all estimators and data quantiles (not shown). Figures 1 and 2 show good agreement 
between the estimated and true cumulative hazard and survival functions for the contact 
interval distribution across the range of available data. The true cumulative hazard function 
(top) and survival function (bottom) are indicated by dashed lines. The Nelson-Aalen and 
Kaplan- Meier estimates (left) use information on who-infected-whom. The marginal Nelson- 
Aalen and Kaplan-Meier estimates (right) do not use this information. The point clouds 
are the point estimates at the 5 th , 10 th , 25 th , 50 th , 75 th , 90 th , and 95 th percentiles of the 
possible contact intervals in each of 1,000 simulations. The clouds for the 5 th and 10 th and 
the 90 th and 95 th percentiles often run together, leaving five separate clouds of points. 



A(r) 
A(r) 
A(r) 



\pr for the Weibull(.5, 1) distribution 
r for the exponential(l) distribution, and 
t 2 for the Weibull(2, 1) distribution. 
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3.2. Mass-action simulations 

For mass-action models, we used Weibull(.5, 5), exponcntial(2), and Weibull(2, 1) normal- 
ized contact interval distributions. The corresponding cumulative hazard functions are: 



In a mass-action model, a Wcibull(.5, 1) distribution for the normalized contact interval 
distribution produces Rq — .89, so no epidemics occur. Changing the rate parameter to 
5 produces Rq — 1.98, which is very close to the i?o = 2 of the other two models. These 
are Ro estimates assuming a network with no clustering (Kenah, 2010), which places an 
upper bound on the true Ro in networks with clustering. The hazard of infectious contact 
decreases throughout the infectious period in the Weibull(.5, 5) case, remains constant in 
the exponential 1) case, and increases throughout the infectious period in the Weibull(2, 1) 
case. 

The LI tolerance for the EM algorithm was set to .005. This was approximately the 
minimum tolerance consistently achieved by the EM algorithm before the LI differences 
became a chaotic sequence of small numbers. 

Results The EM algorithm for the marginal Nelson- Aalen estimate took longer to converge 
for mass-action models than for network-based models, despite the greater tolerance. For 
models with exponential contact intervals, the maximum number of iterations required was 
19. For Weibull(2, 1) contact intervals, the maximum was 29. For Weibull(.5, 5) contact 
intervals, the maximum was 32. Table 2 shows coverage probabilities and exact 95% confi- 
dence intervals for the Nelson-Aalen, marginal Nelson-Aalen, Kaplan-Meier, and marginal 
Kaplan-Meier estimators in models with Weibull(.5, 5) and Weibull(2, 1) normalized con- 
tact interval distributions. Coverage probabilities for the Nelson-Aalen and Kaplan-Meier 
estimators were close to .95 across the range of available data for both models. In con- 
trast, the marginal Nelson-Aalen and Kaplan-Meier estimators were spectacular failures. 
In the model with exponential normalized contact interval distributions, coverage prob- 
abilities for Nelson-Aalen and Kaplan-Meier estimators were all above .93 and coverage 
probabilities for the marginal Nelson-Aalen and Kaplan-Meier estimators were all above 
.99. Figures 3 and 4 show good agreement between the estimated and true survival and cu- 
mulative hazard functions when who-infected-whom is observed (left) but poor agreement 
when who-infected-whom is not observed (right). The dashed lines and clouds of points 
have the same interpretation as in Figures 1 and 2. 

The number of possible infectors for each infection is much, much larger in a mass- 
action model than in a network-based model. It seems likely that the EM algorithm cannot 
usefully assign probabilities to the possible infectors of each infected person because the 
asymptotic approximation to the marginal Nelson-Aalen estimate is not sufficiently precise 
and because of limited numerical precision. In the case of exponential contact intervals, the 
initial step of the EM algorithm correctly guesses that all possible infectors are equally likely 
to have been the source of infection, so the EM algorithm converges. The large coverage 
probabilities suggest that the variance approximation from equation (54) overestimates the 
true variance. 



A.(r) 
A»(r) 
A„(t) 



VEt for the Weibull(.5, 5) distribution 
r for the exponential(l) distribution, and 
t 2 for the Weibull(2, 1) distribution. 
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4. Data analysis: Pandemic influenza A(H1N1) in Los Angeles County, 2009 

In this section, we use the marginal Nelson- Aalen estimator to analyze pandemic influenza 
A(H1N1) (pHlNl) surveillance data collected by the Los Angeles County Department of 
Public Health between April 22 and May 19, 2009. The data was collected according to the 
following protocol (Sugimoto et al., 2011): 

(a) Individuals who presented to healthcare providers or the county health department 
with acute febrile respiratory illness (AFRI, defined as fever > 100°F plus cough, sore 
throat, or runny nose) had nasopharyngeal swabs and aspirates tested for pandemic 
influenza A(H1N1). Those who tested positive are the index cases. 

(b) The index cases were interviewed by telephone and asked to report AFRI episodes 
among household contacts, including the dates of illness onset. Additional AFRI 
episodes among household contacts were ascertained during follow-up interviews 14 
days after the illness onset of the index case. All cases of AFRI in the household after 
10 days prior to symptom onset in the index case were assumed to be pHlNl cases. 
The earliest pHlNl case in each household is the primary case; this is usually (but 
not always) the index case. 

There were 58 households with a total of 299 members. In these households, there were 
62 primary cases (four households had co-primary cases) and 35 secondary cases. In 51 of 
the 58 households, the index case was a primary case. This is a good example of household 
surveillance data from the early stages of an epidemic. 

As in the simulations, our goal is to estimate the cumulative hazard function of the 
contact interval distribution. We compare the marginal Nelson- Aalen estimate with para- 
metric estimates obtained using the methods from Kenah (2010). We use the corresponding 
marginal Kaplan-Meier estimate to estimate the probability that an infected person makes 
infectious contact with a given household member during his or her infectious period, which 
we call the household infectious contact probability. 

Our natural history assumptions are adapted from Yang et al. (2009). In the primary 
analysis, we assumed an incubation period of 2 days, a latent period of days, and an 
infectious period of 6 days. This means that a person i with onset of symptoms on day i^ ym 
is infected on day ti = i- ym — 2, has onset of infectiousness on day ti, and recovers from 
infectiousness on day ti + 6. In discrete time, the first day on which i is able to infect other 
persons is the day after his or her onset of infectiousness, which is ti + 1 = tf m — 1 under our 
assumptions. If day is the infection time, then we have the first secondary transmissions 
on day 1, the onset of symptoms on day 2, and recovery on day 6. In a sensitivity analysis, 
we vary the incubation period from 1 day to 3 days, the infectious period from 5 days to 7 
days, and the latent period from days to 1 day. 

4.1. Results 

Figure 5 shows the marginal Nelson- Aalen estimate of the cumulative hazard function of the 
contact interval distribution, along with approximate 95% confidence limits and parametric 
estimates assuming exponential and Weibull contact interval distributions. A cumulative 
hazard estimate assuming a gamma contact interval distribution (not shown) was almost 
exactly the same as that assuming a Weibull distribution. 

The exponential, Weibull, and marginal Nelson- Aalen point estimates of the cumulative 
hazard at 6 days post-infection are .0687, .0692, and .0729, respectively. The corresponding 



1 6 Kenah 



survival probabilities are .9334, .9331, and .9296. Thus, all three estimates indicate a 
household infectious contact probability of .07 over the course of the infectious period. 
The approximate 95% confidence interval for the marginal Nelson- Aalen estimate is (.0503, 
.1056). The corresponding 95% confidence interval for the marginal Kaplan-Meier estimate 
of the survival probability is (.8997, .9509). Therefore, our nonparametric estimate of the 
household infectious contact probability is .07 (.05, .10). 

While the nonparametric and parametric estimates agree on the household infectious 
contact probability, they differ in their estimates of distribution of infectiousness over time. 
The exponential estimate inherently predicts a constant hazard of infectious contact over 
the entire infectious period. The Weibull estimate also predicts little variation in the hazard 
of infectious contact over the infectious period, with slightly lower infectiousness near the 
beginning and slightly higher infectiousness near the end. The marginal Nelson- Aalen curve 
has much larger jumps on days 1, 2, and 3 following infection than on days 5 and 6, which 
places the highest infectiousness on the day prior to, the day of, and the day after the onset 
of symptoms. According to this estimate, only 12% of infectious contacts occur > 2 days 
after the onset of symptoms and only 6% occur > 3 days after symptom onset. 

4.2. Sensitivity analysis 

Figure 6 shows the results of a sensitivity analysis in which we vary the assumed (a) incu- 
bation period, (b) latent period, and (c) infectious period. In panel (d), we vary all three of 
these intervals simultaneously to generate the greatest possible variation. The results are 
insensitive to the incubation period and infectious period but sensitive to the latent period. 
However, there is good evidence that little or no transmission occurs two or more days prior 
to the onset of symptoms (Donnelly et al., 2011). 

4.3. Previous pH1N1 household data analyses 

Our nonparametric estimate of .07 (.05, .10) for the household infectious contact probability 
is very similar to the .06 (.03, .11) estimated using data collected by Public Health-Seattle 
and King County during April-May, 2009 (Sugimoto et al., 2011). This estimate was ob- 
tained using a chain-binomial model to estimate the probability of infectious contact with 
a given household member on each day of the infectious period and then calculating the 
probability that infectious contact occurs at least once, making it directly comparable with 
our estimate. The relationship between contact intervals and chain binomial models is 
explained in Appendix B. 

It is more difficult to compare our estimated household infectious contact probability 
with estimates of the household secondary attack rate (SAR), which are usually obtained 
by calculating the proportion of all susceptible members of study households who have the 
onset of a given set of symptoms within a given time period relative to symptom onset 
in the index case. Among these estimates for pHlNl are: .145 (.129, .164) for influenza- 
like illness (ILI) within 7 days after the index case symptom onset (Carcione et al., 2011), 
.13 for acute respiratory infection (ARI) and .9 for ILI within 1-9 days after the index case 
symptom onset (Morgan et al., 2010), .113 (.088, .137) for ILI up to 23 days after index case 
symptom onset (France et al., 2010), and .13 for ARI and .10 for ILI 7 days before or after 
the index case symptom onset (Cauchemez et al., 2009). We did the following simulations 
to see if our estimated household infectious contact probability of .07 is consistent with 
these estimates: 
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(a) For each of the 58 households in the LA data, we simulated an epidemic using its 
observed number of index cases and an household infectious contact probability of 
.07. 

(b) After the epidemics in all households completed, we calculated the proportion of all 
household contacts who were infected. 

Repeating this 10,000 times, we got a mean household SAR of .12 with a bootstrap per- 
centile 95% confidence interval of (.06, .19), which places us solidly within the range of 
previous household SAR estimates. 

Our nonparametric estimate of the distribution of infectiousness over time is consistent 
with estimates of the serial interval distribution for pHlNl by Donnelly et al. (2011), who 
found high infectiousness on days 0, 1, and 2 following symptom onset with only 18% of 
transmission occurring > 2 days after the onset of symptoms and only 5% occurring > 3 
days after symptom onset. A similar pattern of high infectiousness early in the infectious 
period was found by Cauchemez et al. (2009), who also used a serial interval distribution. 
Though theoretical and practical problems with the serial interval distribution were noted 
in Section 1, the similarity between our results and those of Donnelly et al. (2011) and 
Cauchemez et al. (2009) suggests that the marginal Nelson-Aalen estimate captured an 
important feature of pHlNl transmission that was missed by the parametric contact interval 
distribution estimates. 

5. Discussion 

The marginal Nelson- Aalcn estimator suffers many of the same limitations as the parametric 
estimators in Kenah (2010) and some of its own. In roughly decreasing order of difficulty, 
these include: 

(a) The SEIR framework is limited to acute, immunizing infectious diseases that spread 
person-to-person, so our methods do not apply directly to tuberculosis, HIV/AIDS, 
many foodborne and waterborne diseases, pneumococcal and meningococcal diseases, 
and other infectious diseases of major public health importance. 

(b) We have assumed that contact intervals are independent of infectious periods, which 
is unrealistic. Both are affected by the interaction between the host and the pathogen. 

(c) We have assumed that infection times, times of onset of infectiousness, and times of 
recovery from infectiousness are observed, which is unrealistic. Usually, only symptom 
onset times are observed. In the analysis of the Los Angeles County household data in 
Section 4, we were forced to assume constant latent, incubation, and infectious periods 
because we had only symptom onset times. The implicit assumption of constant latent, 
incubation, and infectious periods in methods based on generation and serial intervals 
was one of our main criticisms in Section 1. Unfortunately, our analysis merely made 
these same assumptions explicit. 

(d) We assumed that the contact interval distribution is identically distributed for all 
ordered pairs ij such that Cy = 1, which is unrealistic. For instance, age was found 
to have an effect on pHlNl transmission in most of the papers cited in Section 4 
(Cauchemez et al., 2009; France et al., 2010; Morgan et al., 2010; Carcione et al., 
2011; Sugimoto et al., 2011). The effects of covariates on the transmission of disease 
is a central concern of infectious disease epidemiology. 
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(e) There are several technical issues that need further exploration. A more rigorous 
theoretical analysis of convergence of the marginal Nelson- Aalen estimator, including 
the smoothing step, is needed to better understand the conditions under which it will 
work. Convergence criteria need to be more carefully examined, confidence intervals 
for small samples need to be developed, and more general stopping times for the end 
of observation need to be considered. 

Nonetheless, survival analysis based on contact intervals is a powerful framework for the 
development of statistical methods in infectious disease epidemiology. These methods have 
important theoretical and practical advantages over methods based on generation or serial 
intervals, including greater flexibility in the choice of an underlying transmission model, 
greater clarity about data requirements and analytical assumptions, the exploitation of 
information contributed by uninfected person-time, and validity throughout the course of 
an epidemic. The theory and methods of survival analysis offer many possibilities for 
overcoming the limitations listed above. 

The arguments for convergence of the Nelson-Aalen estimator given in Section 2 are 
largely heuristic, but the theoretical analysis required to address limitation (c) is compli- 
cated by the smoothing step. In simulations, the performance of the marginal Nelson-Aalen 
estimator did not appear sensitive to the smoothing method. However, a smoother designed 
specifically for hazard functions, such as that of Miiller and Wang (1994), may improve 
performance under certain conditions. Another technical limitation is that the variance 
estimates based on equation (40) treat the estimated transmission network probabilities 
as known, ignoring a component of the variance that may be important in small samples. 
These and other technical issues need to be resolved with more rigorous theoretical analysis 
and further simulation studies. 

Limitation (d) can be addressed by incorporating the marginal Nelson-Aalen estimate 
into a semiparametric regression model for the effects of covariates on the hazard of in- 
fectious contact. Suppose the hazard of infectious contact from person i to person j at 
infectiousness age r of % is 

Xtf(r) - Ao(T)exp03 T A%-), (62) 

where /3 is a parameter vector and Xij is a vector of infectiousness covariates for i, suscep- 
tibility covariates for j, and pairwise covariates for ij. The methods of Section 2 could be 
extended to nonparametrically estimate Ao(r) by adding a step to the EM algorithm that 
maximized the expected log likelihood over [3. This would yield a semiparametric estimator 
of the effects of covariates on infectiousness and susceptibility that is similar in spirit to 
the Cox proportional hazards model. In infectious disease epidemiology, these covariate 
effects are usually a higher priority than the distribution of infectiousness over time, so 
the marginal Nelson-Aalen estimator may find its most important applications behind the 
scenes of this regression model. 

Limitation (c) could be addressed by adopting a semiparametric Bayesian framework 
(Sinha and Dey, 1997) or using a profile sampler (Lee ct al., 2005). A semiparametric 
Bayesian approach would use a prior process instead of a marginal Nelson-Aalen estimator. 
A profile sampler would take advantage of the fact that the marginal Nelson-Aalen estimate 
is an efficient numerical method of obtaining a likelihood maximized over all possible contact 
interval distributions for a given set of infection times, infectiousness onset times, and 
recovery times. 

Limitation (b) could be addressed by using multivariate survival methods to estimate the 
joint distribution of the contact interval and the infectious period, and limitation (a) could 
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be addressed by allowing individuals to experience multiple infection events or infection 
events of different types (e.g., new infection, carriage, relapse). 

Survival analysis based on contact intervals is a promising approach to the statistical 
analysis of infectious disease data. The marginal Nelson- Aalen estimator extends a beauti- 
ful and powerful set of nonparametric methods from standard survival analysis to infectious 
disease epidemiology. The simulations in Section 3 show that these methods are reliable, 
and the data analysis in Section 4 shows that the nonparametric methods in this paper are 
an important extension of the parametric methods in Kenah (2010). Each of the ingredi- 
ents of this approach have been applied to infectious diseases before: counting processes 
and martingales by Becker (1989) and Rhodes et al. (1996), the EM algorithm by Becker 
(1997), and summation over possible transmission networks by Wallinga and Teunis (2004). 
The marginal Nelson- Aalen estimator combines these in a novel and useful way, correcting 
approaches based on generation and serial intervals, generalizing the chain binomial model, 
and pointing the way to more flexible and practical statistical methods for infectious disease 
epidemiology. 
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A. Partially-observed transmission networks 

In a partially-observed transmission network, we observe the infectors of a subset of all in- 
fected persons. By taking Vj — Vj for all j such that the infector is observed, equations (28) 
through (33) define a marginal Nelson- Aalen estimator that takes this additional informa- 
tion into account. For each j with an observed Vj, N*(t) = N Vj j(r). The variance estimate 
in equation (43) can be used to obtain confidence intervals via equation (44). 

B. Contact intervals and chain binomial models 

In this appendix, we show that the methods of Section 2 reproduce classical chain-binomial 
models when the contact interval distribution is discrete. In the case of a continuous contact 
interval distribution (Kenah et al., 2008; Kenah, 2010), the likelihood when who-infected- 
whom is not observed is 

L ^ = II [E A ^^- £ *)] II S^A^-U-e,])^, (63) 

j:0<tj<T ieVj i:ti<tj 

where A(r) = . If we take A(t) = S ^ T , this is equivalent to the integrated 

likelihood maximized by the EM algorithm in Section 2. 

When the contact interval and latent period distributions are discrete, we must account 
for the possibility of tied infectious contact times. Instead of each infected person j having 
a single infector, the source of his or her infection can be any nonempty subset Vj C Vj. 
Therefore, the hazard term for j in the likelihood becomes 

e n ^ - ** - ^ n t 1 A fe - ** - e *)] ■ ( 64 ) 

where 

A(r) = SiT 2;" iT) (65) 
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is the conditional probability of infectious contact at r given no infectious contact before r. 
Equation (64) contains all terms in 

J[ [1 - \{tj -U- e< )] + X(tj - U - Si) = 1 (66) 

except 

II [l-Afe-i, -eO], (67) 

so the likelihood for a discrete contact interval distribution can be written 

L(S) = [] (l- J] [1-Afo-ti-eO]) J] S^Afe-ti-eJ) ". (68) 

j:0<t i <T iGVj i:t 4 <% 

Since 1 — A(t) can be interpreted as an infection escape probability, the likelihood in equa- 
tion (68) is precisely the same as that of a chain-binomial model (Rampey et al., 1992). 
Therefore, chain-binomial models can be defined in terms of contact interval distributions, 
and the methods of this paper can be seen as an extension of chain-binomial models to 
continuous time. 
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Table 1. 95% confidence interval coverage probabilities in network-based simulations. 



Data 




Contact interval distribution 






percentile 


Weibull(.5, 1) 


Weibull(2, 1) 






Nelson- Aalen 


marginal NA 


Nelson- Aalen 


marginal NA 


5 


.949 (.933, .962) 


.946 (.930, .959) 


.960 (.946, .971) 


.903 (.883, 


.921) 


10 


.954 (.939, .966) 


.952 (.937, .964) 


.963 (.949, .974) 


.901 (.881, 


.919) 


25 


.957 (.943, .969) 


.947 (.931, .960) 


.952 (.937, .964) 


.929 (.911, 


.944) 


50 


.952 (.937, .964) 


.940 (.923, .954) 


.944 (.928, .957) 


.924 (.906, 


.940) 


75 


.948 (.932, .961) 


.944 (.928, .957) 


.958 (.944, .970) 


.958 (.944, .970) 


90 


.935 (.918, .949) 


.944 (.928, .957) 


.956 (.941, .968) 


.939 (.922, 


.953) 


95 


.934 (.917, .949) 


.938 (.921, .952) 


.942 (.926, .956) 


.935 (.918, 


.949) 




Kaplan-Meier 


marginal KM 


Kaplan-Meier 


marginal KM 


5 


.949 (.933, .962) 


.946 (.930, .959) 


.960 (.946, .971) 


.903 (.883, 


.921) 


10 


.954 (.939, .966) 


.952 (.937, .964) 


.964 (.951, .975) 


.901 (.881, 


.919) 


25 


.957 (.943, .969) 


.947 (.931, .960) 


.952 (.937, .964) 


.929 (.911, 


.944) 


50 


.952 (.937, .964) 


.940 (.923, .954) 


.944 (.928, .957) 


.925 (.907, 


.941) 


75 


.948 (.932, .961) 


.945 (.929, .958) 


.960 (.946, .971) 


.957 (.942, 


.969) 


90 


.934 (.917, .949) 


.944 (.928, .957) 


.955 (.940, .967) 


.940 (.923, 


.954) 


95 


.933 (.916, .948) 


.938 (.921, .952) 


.943 (.927, .957) 


.936 (.919, .950) 



Table 2. 95% confidence interval coverage probabilities for mass-action simulations. 



Data 
percentile 

5 
10 

25 
50 
75 
90 
95 

5 
10 

25 
50 
75 
90 
95 



Scaled contact interval distribution 



Nelson 
.949 (.933 
.957 (.943 
.953 (.938 
.951 (.936 
.934 (.917 
.932 (.915 
.941 (.925 



Weibull(.5, 5) 
Aalen marginal NA 

003 (.001, .009) 

004 (.001, .010) 
011 (.006, .020) 
056 (.043, .072) 
842 (.818, .864) 
091 (.074, .111) 
035 (.024, .048) 



, .962) 
969) 
965) 
964) 
949) 
, .947) 
, .955) 



Kaplan-Meier 
.950 (.935, .963) 
.958 (.944, 
.953 (.938, 
.951 (.936, 
.933 (.916, 
.930 (.912, 
.941 (.925, 



.970) 
.965) 
.964) 
.948) 
.945) 
.955) 



marginal KM 
.003 (.001, .009) 
.004 (.001, .010) 
.011 (.006, .020) 
.056 (.043, .072) 
.842 (.818, .864) 
.091 (.074, .111) 
.035 (.024, .048) 



Weibull(2, 1) 



Nelson- Aalen 

.689 (.659, .718) 

.960 (.946, .971) 

.963 (.949, .974) 

.944 (.928, .957) 

.963 (.949, .970) 

.959 (.945, .970) 

.945 (.929, .958) 

Kaplan-Meier 

.689 (.659, .718) 

.960 (.946, .971) 

.963 (.949, .974) 

.944 (.928, .957) 

.963 (.949, .974) 

.959 (.945, .970) 

.938 (.921, .952) 



marginal NA 
(0, .004) 
(0, .004) 
(0, .004) 
(0, .004) 
.001 (.000, .006) 
.178 (455, .203) 
.004 (.001, .010) 

marginal KM 
(0, .004) 
(0, .004) 
(0, .004) 
(0, .004) 
.001 (.000, .006) 
.178 (.155, .203) 
.004 (.001, .010) 
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Fig. 1. Estimates of the cumulative hazard and survival functions for network-based models with a 
Weibull(.5, 1) contact interval distribution, with (left) and without (right) data on who-infected-whom. 
The true cumulative hazard function (top) and survival function (bottom) are indicated with dashed 
lines. The point clouds are point estimates at the 5 th , 10 th , 25 th , 50 th , 75 th , 90 th , and 95 th percentiles 
of the possible contact intervals in each of 1 ,000 simulations. 
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Nelson-Aalen estimates 
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Fig. 2. Estimates of the cumulative hazard and survival functions for network-based models with a 
Weibull(2, 1) contact interval distribution, with (left) and without (right) data on who-infected-whom. 
The true cumulative hazard function (top) and survival function (bottom) are indicated with dashed 
lines. The point clouds are point estimates at the 5 th , 10 th , 25 th , 50 th , 75 th , 90 th , and 95 th percentiles 
of the possible contact intervals in each of 1 ,000 simulations. 
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Nelson-Aalen estimates 



Marginal Nelson-Aalen estimates 
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Fig. 3. Estimates of the cumulative hazard and survival functions for mass-action models with a 
Weibull(.5, 5) normalized contact interval distribution using the methods from Section 2.4, with (left) 
and without (right) data on who-infected-whom. The true cumulative hazard function (top) and sur- 
vival function (bottom) are indicated with dashed lines. The point clouds are point estimates at the 
5 th , 10 th , 25 th , 50 th , 75 th , 90 th , and 95 th percentiles of the possible contact intervals in each of 1 ,000 
simulations. Note the deviation of the marginal Nelson-Aalen and Kaplan-Meier estimates from the 
true functions. 
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Nelson-Aalen estimates 



Marginal Nelson-Aalen estimates 
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Fig. 4. Estimates of the cumulative hazard and survival functions for mass-action models with a 
Weibull(2, 1) normalized contact interval distribution using the methods from Section 2.4, with (left) 
and without (right) data on who-infected-whom. The true cumulative hazard function (top) and sur- 
vival function (bottom) are indicated with dashed lines. The point clouds are point estimates at the 
5 th , 10 th , 25 th , 50 th , 75 th , 90 th , and 95 th percentiles of the possible contact intervals in each of 1 ,000 
simulations. Note the deviation of the marginal Nelson-Aalen and Kaplan-Meier estimates from the 
true functions. 
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Influenza A(H1N1) contact intervals in LA county 



CO 
N 
CO 
-£= 

CD 
> 

E 
O 



oo 
o 



CD 
O 

d 



o 



C\J 

o 



o 
o 



~r 





Marginal Nelson-Aalen estimate 

95% confidence limits 
Exponential estimate 

Weibull estimate 



T 



Days since onset of infectiousness 



Fig. 5. A marginal Nelson-Aalen estimate of the contact interval cumulative hazard function based 
on the Los Angeles County household data. For comparison, we have parametric estimates of 
the cumulative hazard function assuming exponential and Weibull contact interval distributions. The 
assumed latent period for all analyses was 2 days, so symptom onset occurs on day 2 after the onset 
of infectiousness. Note the large jumps on days 1 , 2, and 3 following the onset of infectiousness and 
the small jumps on days 5 and 6. 
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Fig. 6. Sensitivity analyses for the nonparametric analysis of the LA county household data. There 
is little sensitivity to the assumed incubation and infectious periods. There is greater sensitivity to 
the assumed latent period, but there is good evidence for transmission on the day prior to symptom 
onset and little evidence of significant transmission > 2 days prior to symptom onset (Cauchemez 
et al., 2009; Donnelly et al., 201 1 ). 



